Introduction

“Provide background on your data sets and a clear formulated question or hypothesis.”

For this project, my question of Interest is: “How does crime rate relate to poverty in Canada?”

In order to answer this question, data on both crime and socioeconomic status are needed. However, I found no existing data set that contains all desired information, therefore this needs to be achieved through merging more than one data sets. Aftering choosing carefully, the following two separate data sets are obtained:

  1. “Income of individuals by age group, sex and income source, Canada, provinces and selected census metropolitan areas”. Released 2023-05-02. This data set is annually updated and maintained by Statistics Canada (Table 11-10-0239-01). Data is collected through the Survey of Labor and Income Dynamics, Survey of Consumer Finances, and Canadian Income Survey.

  2. “Incident-based crime statistics, by detailed violations, Canada, provinces, territories, Census Metropolitan Areas and Canadian Forces Military Police”. Released 2023-07-27. This data set is also annually updated and maintained by Statistics Canada (Table 35-10-0177-01, formerly CANSIM 252-0051). Data is collected through the Uniform Crime Reporting Survey.

Understanding the relationship between crime rates and poverty in Canada is crucial for policymakers, law enforcement agencies, and social welfare programs. Exploring this correlation can shed light on the socioeconomic factors driving criminal behavior and help formulate targeted interventions to alleviate poverty and reduce crime. Additionally, elucidating this connection can inform broader discussions on social inequality, justice, and community well-being in Canadian society.

Methods

“Include how and where the data were acquired, how you cleaned and wrangled the data, what tools you used for data exploration.”

Both data sets are downloaded directly from Statistics Canada, which is usually considered to be an reliable source. Because they share the same source, the data sets follows similar structure and all contains the two columns GEO and REF_DATE where the former one refers to the geographical region and the second one refers to the year of data. Thus, it’s possible to combine the two data sets to obtain all information needed.

However, it is worth mentioning that both data sets are huge and contains unrelated information. Therefore, cleaning and wrangling are needed for more convenient analysis and more efficient computing & uploading, as the original data sets are oversize thus cannot be pushed to github repository.

Reference:

  1. The census data set: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1110023901
  2. The crime data set: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3510017701
fn1 <- "https://raw.githubusercontent.com/inorrr/JSC370_project/main/census.csv"
fn2 <- "https://raw.githubusercontent.com/inorrr/JSC370_project/main/crime.csv"

if (!file.exists("census.csv"))
  download.file(fn1, destfile = "census.csv")
census_df <- data.table::fread("census.csv")

if (!file.exists("crime.csv"))
  download.file(fn2, destfile = "crime.csv")
crime_df <- data.table::fread("crime.csv")

Data Wrangling and Cleaning

  1. First I dropped the unrelated columns in both columns, keeping only information revelant to the question of interest.
crime_df <- crime_df[, c("REF_DATE", "GEO", "Violations", "Statistics", "VALUE", "UOM")]
census_df <- census_df[, c("REF_DATE", "GEO", "Age group", "Sex", "Income source", "Statistics", "VALUE", "UOM", "SCALAR_FACTOR")]
  1. I noticed that in the GEO column:
  • both data sets uses a mix of province & city names. I removed all observations with city names and keep only those province data so that the data set is smaller(since the province data contains cities already) and avoids redundancy.
  • Notice that the census data only have data from the 10 provinces, not the 3 territories, I filtered out those data from the crime data set as there are no matching data.
  • Notice that there is a square bracket with a number behind each province name in the crime data, therefore I removed it with regular expression so that we can join the two data sets on province later.
table(census_df$GEO)

provinces1 <- c("Alberta [48]", "British Columbia [59]", "Manitoba [46]", "New Brunswick [13]", 
               "Newfoundland and Labrador [10]", "Saskatchewan [47]",
               "Nova Scotia [12]", "Ontario [35]", 
               "Prince Edward Island [11]", "Quebec [24]")

provinces2 <- c("Alberta", "British Columbia", "Manitoba", "New Brunswick", 
               "Newfoundland and Labrador", "Saskatchewan","Nova Scotia", 
               "Ontario", "Prince Edward Island", "Quebec")

crime_df <- crime_df[crime_df$GEO %in% provinces1, ]
census_df <- census_df[census_df$GEO %in% provinces2, ]

crime_df$GEO <- gsub("\\s*\\[\\d+\\]$", "", crime_df$GEO)
table(crime_df$GEO)
  1. In the Statistics column of the crime data, there are many measures related to crime, but I’m only interested in the number of incidents (Actual incidents) and crime rate (Rate per 100,000 population), thus statistics related to charges are removed. The Crime Severity Index(Percentage contribution to the Crime Severity Index (CSI)) seems to be interesting and thus is kept.
crime_df <- crime_df %>% filter(Statistics == "Actual incidents" | 
                                Statistics == "Rate per 100,000 population" | 
                                Statistics == "Percentage contribution to the Crime Severity Index (CSI)")
  1. In the crime data frame,
  • there are 314 different types of crime, which does much more detailed categorization than I’m interested in. Therefore, to avoid having too much computation to merge data later on, I choose too keep only the bigger categories (i.e. total robbery, total assaults, etc).
  • There are also square brackets at the end so I removed them.
print(length(unique(crime_df$Violations)))
table(crime_df$Violations)

# Identify rows that start with "Total"
total_rows <- grepl("^Total", crime_df$Violations)

# Subset the dataframe to keep only the rows starting with "Total"
crime_df <- crime_df[total_rows, , drop = FALSE]

# Remove square brackets and numbers at the end
crime_df$Violations <- gsub("\\s*\\[\\d+\\]$", "", crime_df$Violations)
  1. In the census data frame:
  • the column Age group specifies the age however since we do not have this information in the crime df, we need to combine all age groups. This can be done by taking the average of the categories.

  • Same for sex, same method is used.

table(census_df$"Age group")
table(census_df$"Sex")

# first we merge the age group categories
census_df <- census_df %>% 
  group_by(REF_DATE, GEO, Sex, `Income source`, Statistics, UOM, SCALAR_FACTOR) %>% 
  summarise(VALUE = mean(VALUE, na.rm = TRUE))

# next we merge the age group categories
census_df <- census_df_new %>% 
  group_by(REF_DATE, GEO, `Income source`, Statistics, UOM, SCALAR_FACTOR) %>% 
  summarise(VALUE = mean(VALUE, na.rm = TRUE))
  1. I keep only the data between 1998 and 2021, as that’s the year range where the two data sets overlap.
crime_df <- crime_df %>% filter(REF_DATE >= 1998 & REF_DATE <= 2021)
census_df <- census_df %>% filter(REF_DATE >= 1998 & REF_DATE <= 2021)
  1. Write the cleaned data frame to CSV files.
write.csv(crime_df, "/Users/yinuozhao/Desktop/UofT/JSC370/JSC370-2024-main/JSC370_project/crime.csv")
write.csv(census_df, "/Users/yinuozhao/Desktop/UofT/JSC370/JSC370-2024-main/JSC370_project/census.csv")

At this point both the crime data frame and census data frame has REF_DATE and GEO in common, and they each have another categorical variable, which is Income source for census data and Violation(it means crime type) for crime data. While it may seem to make sense to join the two data sets using REF_DATE and GEO directly, the results would involves the data for all combinations of Income source and Violation for each REF_DATE and GEO. This will be a huge data set and thus slow down the computation. Therefore, I choose to keep the data sets separate and join them when necessary (i.e. after picking out certain categories of interest).

Exploratory Data Analysis

Notice that right now both data sets are in long format, I converted them to wide for convenience.

crime_df <- pivot_wider(crime_df, id_cols = c(REF_DATE, GEO, Violations), 
                        names_from = Statistics, values_from = VALUE)
crime_df <- na.omit(crime_df)

census_df <- pivot_wider(census_df, id_cols = c(REF_DATE, GEO, `Income source`), 
                         names_from = Statistics, values_from = VALUE)
census_df <- na.omit(census_df)

Check the dimensions and headers and footers of the data

dim(census_df)
dim(crime_df)
head(crime_df)
tail(crime_df)
head(census_df)
tail(census_df)

The census data set has 8 variables with 3613 observations, the crime dataset has 6 variables with 9271 observations. By looking at the headers and footers of both data sets, they seems to be imported correctly and contains no missing values (in the displayed rows).

Check the variable types in the data

str(census_df)
str(crime_df)
summary(census_df)
summary(crime_df)

In both data frames, we see that the variable types are a mix of integer, numeric and characters. All variable types correctly align with the context of the variables. No major problems arises with the data at this stage (i.e. a variable with all missing values.)

Take a closer look at some/all of the variables

For both data frame, we need REF_DATE and GEO to correctly identify a province in Canada with a valid year. For census data frame, we need to look at the values of the different types of income (median, aggregate, etc). For the crime data frame, we need to look at the recorded crime rate and actual number of incidents to be within the reasonable range.

table(census_df$REF_DATE)
table(census_df$GEO)
table(crime_df$REF_DATE)
table(crime_df$GEO)
summary(census_df$`Aggregate income`)
summary(census_df$`Average income (excluding zeros)`)
summary(census_df$`Median income (excluding zeros)`)
summary(crime_df$`Actual incidents`)
summary(crime_df$`Rate per 100,000 population`)

Both data sets contains data from 1998 to 2021, on the 10 provinces in Canada as desired because I cleaned the data sets this way. Other variables being checked are within the reasonable range. The aggregate income, average income and median income are all measured in 2021 constant dollars, aggregate income record numbers in millions. The crime rates are measured as number of incidents per 100,000 population.

Validate with an external source

Notice that the minimum average income is 677.8, which seems to be much lower than then mean average income, even 10 times lower than the 1st quantile. Since it seems quite suspisous, we need to validate it.

census_df[which.min(census_df$`Average income (excluding zeros)`), ]

This data is from Prince Edward Island in 2004, and the income source is “other government transfers”. Upon research, Government transfers refers to assistance from provincial and municipal programs, Workers’ Compensation benefits, the GST/HST Credit and provincial refundable tax credits such as the Quebec and Newfoundland and Labrador sales tax credits. However, since many of the above mentioned are made to their own category and excluded from “other government transfers” in the data set, it make sense that the value is low.

Preliminary Results

Provide summary statistics in tabular from and publication-quality figures, take a look at the kable function from knitr to write nice tables in Rmarkdown

  1. Firstly, I examined the trend in crime rate across provinces and crime types.
filtered_data = crime_df %>% filter(Violations=="Total, all violations")
unique_x_values <- unique(crime_df$REF_DATE)
ggplot(filtered_data, aes(x = REF_DATE, y = `Rate per 100,000 population`, color = GEO)) +
  geom_line() +
  labs(x = "Year", y = "Rate per 100,000 population", title = "Rates of Total Crime by Province") +
  scale_x_continuous(breaks = unique_x_values) + 
  scale_color_discrete(name = "Provinces") + 
  theme_linedraw() +
  theme(legend.position = "right") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  theme(plot.title = element_text(face = "bold")) + 
  theme(plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")) + 
  theme(axis.text = element_text(size = 7)) + 
  theme(legend.title = element_text(face = "bold"))

The plot depicts the total crime rate, encompassing all types of crimes, from 1998 to 2021, with each province represented by a distinct color. The crime rate is measured per 100,000 population. Notably, Saskatchewan consistently exhibits a significantly higher crime rate compared to other provinces throughout the period of 1998 to 2021. Conversely, Quebec and Ontario consistently demonstrate the lowest crime rates. Across all provinces, there is a discernible decreasing trend in crime rates over the years, with many provinces experiencing peak crime rates in 2003-2004.

In addition to analyzing total crime, I delved into specific crime categories often associated with poverty: break and enter, robbery, and prostitution. The three accompanying plots illustrate their respective rates over the years. Overall, there is a decreasing trend in the rates of all three crimes, with occasional exceptions such as robbery rates in Manitoba. Notably, British Columbia stands out with a significantly high rate of prostitution in 2004, doubling the number reported in Saskatchewan, which held the second-highest rate that year.

Note that the 3 plots below shares the same legend with the above plots. Therefore the legend is omited for better display purpose. Codes are also not shown as they reuses the above code chunk.

Below is a table summarizing the average crime rate of all crime types (Total, all violations is excluded) by province and year. The cells are colored by value, where lighter color means higher value and darker color means smaller value. Scrolling down the table, we can see that for all provinces, the average crime rate has a decreasing trend as the colors are getting dark in each column. This agree with what I found and infered from the previous plots.

Table of Average Crime Rate Across All Types
Year Alberta British Columbia Manitoba New Brunswick Newfoundland and Labrador Nova Scotia Ontario Prince Edward Island Quebec Saskatchewan
1998 990.7389 1310.2795 1140.4897 746.1692 644.1653 875.0903 776.1408 686.4670 702.7424 1420.691
1999 996.9833 1243.8727 1139.4081 751.8170 616.3186 890.2262 705.8405 786.3608 649.5519 1372.294
2000 938.1432 1202.5849 1146.4238 719.6546 621.9546 807.6322 695.1511 739.1711 653.3770 1428.722
2001 979.9516 1233.4349 1213.3189 712.1708 617.2849 812.8562 672.0454 752.6111 634.2719 1518.370
2002 994.3392 1245.2322 1202.5438 726.1322 639.0897 816.2289 651.8003 850.6072 617.0995 1505.492
2003 1065.0392 1317.9670 1366.9247 758.0414 657.4530 893.6070 632.2778 915.8641 638.6003 1728.543
2004 1098.5319 1315.9251 1347.9319 780.0538 667.5708 915.0119 595.3678 893.0314 624.6361 1684.091
2005 1079.1975 1260.0459 1278.7128 692.0257 650.8968 859.9862 574.3292 812.8373 615.6142 1649.431
2006 1007.2778 1203.3584 1260.0861 648.6084 667.1994 856.8851 589.4051 737.7965 609.9108 1502.769
2007 1018.9281 1137.0500 1226.7060 608.8173 697.5675 803.5808 560.8689 670.0778 584.9914 1480.692
2008 984.9581 1046.3411 1081.2483 624.8186 696.9058 772.9158 537.7905 683.8757 589.6462 1412.581
2009 935.6630 983.8919 1153.1356 615.0719 716.9997 773.6461 521.8932 697.9243 578.2262 1403.483
2010 884.1551 933.4835 1043.1349 625.0964 725.7876 763.8995 496.5708 699.9311 562.8811 1432.178
2011 812.9859 880.2127 1012.0920 577.9611 705.3294 714.0051 468.0995 710.5243 519.5841 1408.674
2012 819.5778 856.2392 1000.7560 615.4689 696.8877 714.1444 449.8354 721.6681 510.7627 1311.701
2013 778.9997 803.0581 866.5694 538.5750 682.2809 639.5953 408.5759 669.5086 460.8854 1218.580
2014 783.3216 792.6345 840.0633 482.3608 636.2366 587.6563 391.6511 510.4745 421.5565 1155.519
2015 855.7324 827.4308 894.7058 527.3759 635.6389 551.9057 391.9432 451.0126 399.5150 1258.409
2016 869.6887 818.0165 956.7842 505.9946 630.6538 538.1619 399.0192 468.5732 405.6246 1317.321
2017 927.8811 781.4886 955.9997 549.3957 583.0170 549.4824 417.5035 439.5063 410.9189 1271.176
2018 886.8923 718.8339 912.0272 497.1263 502.9756 493.3066 400.3605 447.9967 339.5035 1188.911
2019 925.4612 796.9984 1016.7941 544.1014 553.7074 498.5129 404.5249 518.0077 326.3725 1151.130
2020 788.6307 735.6162 943.0979 564.7014 557.8670 491.2514 356.3917 446.2160 301.0057 1085.383
2021 730.3198 703.6836 839.5900 590.1656 602.4644 489.7486 360.7433 418.9484 312.5548 1106.370

Given the general decrease in crime rates, I am interested in exploring the trend of income to ascertain the potential existence of an association.

Table of Average Total Income
Year Alberta British Columbia Manitoba New Brunswick Newfoundland and Labrador Nova Scotia Ontario Prince Edward Island Quebec Saskatchewan
1998 44804.17 40441.67 38208.33 34125.00 30316.67 34320.83 45100.00 32633.33 37412.50 37370.83
1999 44237.50 41416.67 38075.00 35112.50 31783.33 36316.67 46620.83 33058.33 38520.83 38000.00
2000 45795.83 41462.50 38841.67 35654.17 32533.33 37070.83 48170.83 34225.00 39716.67 38579.17
2001 47845.83 41916.67 39650.00 36537.50 32600.00 37950.00 48279.17 34445.83 40520.83 40308.33
2002 46983.33 42695.83 39629.17 35870.83 33270.83 38487.50 48091.67 35020.83 40791.67 40150.00
2003 47920.83 41787.50 40129.17 36108.33 33125.00 37716.67 47812.50 35325.00 40570.83 40554.17
2004 49666.67 42941.67 41029.17 36433.33 33470.83 38187.50 48170.83 36070.83 41600.00 40345.83
2005 51758.33 43800.00 41700.00 36400.00 35012.50 39283.33 48395.83 36800.00 40804.17 42166.67
2006 53820.83 44679.17 42562.50 37479.17 36858.33 40408.33 47258.33 38000.00 41604.17 44904.17
2007 56879.17 45850.00 44633.33 39200.00 39304.17 41320.83 48079.17 38120.83 42441.67 47750.00
2008 57337.50 46858.33 45875.00 39520.83 40412.50 40933.33 49212.50 39500.00 41925.00 49100.00
2009 57237.50 45875.00 45437.50 40466.67 40500.00 42362.50 48404.17 39587.50 42545.83 50650.00
2010 57650.00 45400.00 45129.17 40470.83 42195.83 41579.17 48891.67 39841.67 42662.50 50337.50
2011 59145.83 45633.33 44687.50 41508.33 44045.83 42516.67 48133.33 41354.17 43833.33 52483.33
2012 62166.67 46400.00 45195.83 41458.33 46112.50 43358.33 48645.83 40562.50 44308.33 52687.50
2013 61962.50 47854.17 46837.50 42095.83 48583.33 45220.83 49600.00 42529.17 44716.67 53658.33
2014 63200.00 47929.17 47129.17 42545.83 49654.17 45504.17 50062.50 42658.33 45041.67 55975.00
2015 64020.83 47437.50 48054.17 42104.17 50033.33 45445.83 51083.33 43308.33 44354.17 54908.33
2016 57637.50 47354.17 47425.00 43383.33 48466.67 45908.33 51191.67 43366.67 45983.33 52816.67
2017 59825.00 50650.00 49529.17 44858.33 48633.33 45612.50 52683.33 44245.83 46412.50 53900.00
2018 59458.33 51012.50 48875.00 46133.33 49154.17 46508.33 52650.00 44962.50 47091.67 52412.50
2019 58787.50 51791.67 48016.67 45779.17 48695.83 46437.50 52162.50 44616.67 48854.17 51304.17
2020 57991.67 54508.33 50758.33 48325.00 50183.33 48558.33 55320.83 47558.33 51000.00 53500.00
2021 59270.83 55733.33 50300.00 48358.33 51308.33 49062.50 56441.67 47737.50 52195.83 53025.00

summary_stats <- joint_df %>% 
  group_by(GEO) %>% 
  summarise(Correlation = cor(`Average income (excluding zeros)`, `Rate per 100,000 population`))

summary_stats <- summary_stats %>% rename(Province = GEO)

summary_table <- summary_stats %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE) %>%
  add_header_above(c("Correlation between Average Total Income and Total Crime Rate" = 2)) %>% 
  kable_styling(fixed_thead = T)
summary_table
Correlation between Average Total Income and Total Crime Rate
Province Correlation
Alberta -0.7023852
British Columbia -0.8154907
Manitoba -0.7392303
New Brunswick -0.5966958
Newfoundland and Labrador -0.0025743
Nova Scotia -0.9236636
Ontario -0.7583329
Prince Edward Island -0.8082244
Quebec -0.9447648
Saskatchewan -0.7695228

Pick a province to look closer

Summary

What you found so far from your data in terms of the formulated question.